4/07/22 Did not work much today... make up for it tonight Look into time encodings, time2vec, etc. Want to make model extensible to multiple forecast steps (variable). Naive method is to just autoregressively decode but ideally we avoid the sequential auto regression which accumulates and propagates error. Ideally we predict directly or something. For that to happen we need a robust timestep encoding. On top of that we would need to incorporate a longer context length, ie. 7 + number of forecasts in future. This can increase computation time a ton so we have to look into this or come up with a better idea. On top of that we’re going to need to compute seasonal encodings on a rolling basis for each forecasted value... Just realized we might need to amend null mask. Right now everything just ignored the null except itself. Meaning it reduces essentially to removing the null from the sequence and computing everything else with the appropriate pos encoding. The only benefit of the null mask as it is rn is that the decoder can output a value for a null timestep... but in practice nulls don’t show up in the decoder during inference so that trait is pretty much useless. So we might as well do that instead of wasting computation on the null? But there is a benefit to the null mask. If we remove the decoder-encoder null mask, then the null encoding in the encoder could actually be useful as it integrates information from previous steps as well as its timestep encoding, so even though it’s value is null, it contains useful info that could be used by the decoder. So in short we should try removing the dec-enc null mask. And possibly also having separate null encoder masks depending on the layer. Meaning the first layer gets a slightly different encoder mask, where it doesn’t use its interpolated na value, but the next layers encoding of the na can use the encoding of the na position to compute the next layer. Problem sign this is (a) if na appears in first layer, we get the same nan problem as before, and (b) we don’t consider timestep information if we throw out the first layers input. But then we are effectIvely interpolating the na value as 0... good thing is that this will only affect the null timestep and nothing else. But then that almost defeats the purpose of having the null mask... we should defo try out different methods here, but so far it seems best to just remove the na values from the sequence after adding the appropriate timestep encodings, which has the same effect as ignoring the nulls. Or best of both worlds, to avoid the heavy bias of setting the null position to 0, we interpolate it (avg of values in between or carry forward if there is no next observation or multiple in a row) and we also use the idea of not having a dec-enc null mask. That way we can have an approximate encoding for the null position which has a bias but is not as bad a bias, and we limit its propagation to the decoder only. Think about multivariate case. Ie. several series with one important one and the rest exogenous.